picture

Preface: I have been engaged in operation and maintenance for three and a half years, and have encountered various problems, such as data loss, website crash, accidental deletion of database files, hacker attacks and other problems.

Today, I will simply organize it and share it with your friends.

1. Online Operation Specifications

1. Test use

When I first learned the use of Linux, from basics to services to clusters, it was all done in virtual machines. Although the teacher told us that it was no different from the real machine, the desire for the real environment was increasing day by day, but the various snapshots of the virtual machine made We have developed all kinds of cheap habits, so that when we got the server operation authority, we couldn’t wait to try it. I remember the first day of work, the boss gave me the root password. Since I can only use putty, I will I wanted to use xshell, so I quietly logged into the server and tried to log in to xshell+key, because there was no test and no ssh connection was left. After restarting the sshd server, I was blocked from the server. Fortunately, I backed up the sshd_config file at that time. , and later let the computer room personnel cp go there. Fortunately, this is a small company, otherwise it will be directly killed... Fortunately, luck was better back then.

The second example is about file synchronization. We all know that rsync synchronizes very quickly, but it deletes files much faster than rm -rf. There is a command in rsync to synchronize a file based on a certain directory (if the first If a directory is empty, the result can be imagined), the source directory (with data) will be deleted. At first, because of misoperation and lack of testing, I wrote the directory in reverse. The key is that there is no backup... The production environment data is deleted and not backed up. You can think about the consequences for yourself. Its importance is self-evident.

2. Confirm again before Enter

Regarding the error of rm -rf /var, I believe that people with fast hands or when the network speed is relatively slow, the probability of occurrence is quite high. When you find out that after the execution, your heart is at least half cold. You may say, I have pressed so many times without any mistakes, don’t be afraid, I just want to say that you will understand when it happens once, don’t think that those operation and maintenance accidents are all caused by others, if you don’t pay attention, the next one It's you.

3. Do not use multiple people

In the last company I worked for, the operation and maintenance management was quite chaotic. Let me give a typical example. The operation and maintenance personnel who have left for several positions have the server root password. Usually, when we receive a task for operation and maintenance, we will simply check it. If it cannot be solved, we will ask others for help. However, when the problem is overwhelmed, the customer service supervisor (knows some Linux), the network administrator, and your boss will debug a server together. , After various comparisons, I found out that your server configuration file is different from the last time you modified it, and then changed it back, and then googled it. I found the problem excitedly and solved it, but others told you that he also solved it, The modified parameters are different... This, I really don't know which is the real cause of the problem, of course, this is still good, the problem is solved, everyone is happy, but you have encountered the file you just modified, the test is invalid, and then What about when you go to modify and find that the file has been modified again? It's really annoying, and it should not be operated by multiple people.

4. Backup first and then operate

Develop a habit. When you want to modify data, back up first, such as the .conf configuration file. In addition, when modifying the configuration file, it is recommended to annotate the original options, and then copy and modify them. In addition, if there is a database backup in the first example , then the misoperation of rsync will be fine, so it is said that losing the database does not happen overnight, just backing up one is not so miserable.

The data involved

1. Use rm -rf with caution There are many examples on the Internet, various rm -rf /, various deletions of the main database, various operation and maintenance accidents... A small mistake can cause a lot of losses. If you really need to delete it, be careful.

2. Backup operations above all else

Originally, there are all kinds of backups above, but I want to divide it into the data category. Again, backup is very important. Wow, I remember my teacher said a word, it’s okay to be cautious when it comes to data. The company has a third-party payment website and an online loan platform. The third-party payment is fully backed up every two hours, and the online loan platform is backed up every 20 minutes. I won't say much more, you can decide for yourself.

3. Stability above all else

In fact, not only data, but in the entire server environment, stability is more important than everything else, not the fastest, but the most stable, and usability, so it has not been tested, do not use new software on the server, such as nginx+php-fpm, production environment All kinds of php hang and restart, it will be fine, or just change apache.

4. Confidentiality above all else

Now all kinds of pornographic photos are flying all over the sky, and all kinds of router backdoors, so when it comes to data, it is impossible not to keep it confidential.

3. Involving safety

1. ssh

  • Change the default port (of course, if the professional wants to hack you, it will come out after scanning)
  • Disable root login
  • Use ordinary users + key authentication + sudo rules + ip address + user restrictions
  • Use hostdeny similar explosion-proof cracking software (more than a few attempts to block directly)
  • Filter logged in users in /etc/passwd

2. Firewall

The firewall production environment must be open, and the minimum principle must be followed, drop all, and then release the required service ports.

3. Fine-grained permissions and control granularity

Services that can be started by ordinary users must not use root, control the permissions of various services to a minimum, and control the fine-grainedness.

4. Intrusion Detection and Log Monitoring

Use third-party software to constantly detect changes in system key files and various service configuration files, such as /etc/passwd, /etc/my.cnf, /etc/httpd/con/httpd.con, etc.; use centralized log monitoring system, monitor /var/log/secure, /etc/log/message, ftp upload and download files and other alarm error logs; in addition, for port scanning, you can also use some third-party software, and directly pull it into host.deny if it is scanned. This information is very helpful for troubleshooting after the system is compromised. It has been said that the cost of a company's investment in security is directly proportional to the cost of being lost by a security attack. Security is a big topic and a very basic work. If the foundation is well done, the system security can be improved considerably. , the rest is done by security experts

4. Daily monitoring

1. System operation monitoring

Many people start operation and maintenance from monitoring, and large companies generally have professional 24-hour monitoring operation and maintenance. System operation monitoring generally includes hardware occupancy. Common ones include memory, hard disk, cpu, network card, os, including login monitoring, and system key file monitoring. Regular monitoring can predict the probability of hardware damage and bring very practical functions to tuning.

2. Service operation monitoring

Service monitoring is generally a variety of applications, such as web, db, lvs, etc., which generally monitor some indicators and can quickly find and solve performance bottlenecks in the system.

3. Log monitoring

The log monitoring here is similar to the secure log monitoring, but it is generally used for hardware, os, application error reporting and alarm information monitoring when the system is running stably, but once a problem occurs, you have not done monitoring, will be passive

Five, performance tuning

1. In-depth understanding of the operating mechanism

In fact, according to more than a year of operation and maintenance experience, talking about tuning is simply on paper, but I just want to briefly summarize, if I have a more in-depth understanding, I will update it. Before optimizing the software, for example, to deeply understand the operating mechanism of a software, such as nginx and apache, everyone says that nginx is fast, then you must know why nginx is faster, what principle is used, processing requests is better than apache, and it must be able to communicate with others Put it in simple and easy-to-understand words, and if necessary, you must be able to understand the source code, otherwise all the documentation that uses parameters as the tuning object will be nonsense.

2. Tuning the framework and prioritizing

If you are familiar with the underlying operating mechanism, you must have a tuning framework and sequence. For example, when a database bottleneck occurs, many people directly change the configuration file of the database. My suggestion is to analyze the bottleneck first, check the log, and write it out to adjust. The best direction, and then start, and database server tuning should be the last step, the first should be hardware and operating system, the current database server is only released after various tests and applies to all operating systems, should not Start with him.

3. Only adjust one parameter at a time

You only need to adjust one parameter at a time. Compared with everyone's understanding, if you adjust too much, you will be confused yourself.

4. Benchmarking

To judge whether the tuning is useful, and to test the stability and performance of a new version of the software, it is necessary to perform a benchmark test. The test involves many factors. Whether the test is close to the real needs of the business depends on the experience of the tester. You can refer to the third edition of "High Performance MySQL". My teacher once said that there is no one-size-fits-all parameter. Any parameter changes and any tuning must conform to the business scenario, so don't google any more tuning. No lasting effect on your promotion and business environment improvement

6. Operation and maintenance mentality

1. Control your mind

Many rm -rf /data are at the peak of irritability in the first few minutes of get off work, so you are not planning to control your mentality. Someone said that you have to go to work when you are irritable, but you can try to avoid dealing with it when you are irritable. The more stressful the critical data environment is, the more calm you need to be, otherwise you will lose more. Most people have the experience of rm -rf /data/mysql. You can imagine how you feel after you delete it, but if you don't have backups, what's the use of being in a hurry. In this case, you have to calm down and think about it. The worst plan is for mysql, if the physical file is deleted, some tables will still exist in the memory, so disconnect the business, but do not close the mysql database, which is very helpful for recovery, and use dd to copy the hard disk, and then you Of course, most of the time you can only find a data recovery company. Just imagine that the data is deleted, you do various operations, close the database, and then repair, not only may the file be overwritten, but the table in the memory cannot be found.

2. Responsible for data

The production environment is not a child's play, and the database is not a child's play. You must be responsible for the data. The consequences of not backing up are very serious.

3. Get to the bottom of things

Many operation and maintenance personnel are busy, and they will no longer manage when problems are solved. I remember that last year, a customer's website could not be opened. After the PHP code reported an error, it was found that the session and whos_online were damaged. The previous operation and maintenance was repaired by repair. I also fixed it like this, but after a few hours, after three or four repetitions, I went to Google for inexplicable damage to the database table. One is the bug of myisam, the other is the mysql bug, and the third is that the mysql is in the writing process. It was killed in the middle, and finally it was found that the memory was not enough, which caused OOM to kill the mysqld process and there was no swap partition. The background monitoring memory was sufficient, and finally the physical memory was upgraded to solve the problem.

4. Test and Production Environments

Be sure to look at the machine you are on before important operations, and try to avoid opening more windows.

Reprinted: https://zhuanlan.zhihu.com/p/365519427