二、gitlab登陆功能 Username: root Password: 5iveL!fe 三、gitlab注册功能四、配置gitlab邮件发送默认有，需要其它配置去配置 [crayon-5727fa7f5b5db683207969/] 五、关闭与开放gitlab的注册功能使用root登陆后，在setting里面有六、gitlab新建第一个项目，配置ssh公钥，进行pull, push 公钥 https://git-scm.com/book/zh/v1/%E6%9C%8D%E5%8A%A1%E5%99%A8%E4%B8%8A%E7%9A%84-Git-%E7%94%9F%E6%88%90-SSH-%E5%85%AC%E9%92%A5 如果本来电脑里面就有公钥，要查看后再加入。查看用户名和邮箱，是否和注册时填的一致 命令：git config --list 加入ssk后新建项目，查看项目git clone git@xxx 是否生效如果没生效，先用http下载下来，然后用sourceTree提交一个文件，中间会让输入用户名密码，然后提交成功，以后在sourceTree和命令行里面不再提示用户名密码。总结到此，gitlab的搭建，注册，新项目，push 什么的，就都可用了。分配项目下的用户权限也是默认支持。

↧

Host ‘xxx.xx.xxx.xxx’ is not allowed to connect to this MySQL server

April 12, 2016, 11:43 pm

≫ Next: jackson解析json字符串

≪ Previous: CentOS6搭建gitlab，使其拥有github功能，git管理，pull,push，注册，登陆，邮件，域名绑定

添加其它root级别用户访问Mysql 描述： Host 'xxx.xx.xxx.xxx' is not allowed to connect to this MySQL server - Stack Overflow http://stackoverflow.com/questions/1559955/host-xxx-xx-xxx-xxx-is-not-allowed-to-connect-to-this-mysql-server 原因： If you cannot figure out why you get Access denied, remove from the user table all entries that have Host values containing wildcards (entries that contain '%' or '_' characters). A very common error is to insert a new entry with Host='%' and User='some_user', thinking that this allows you to specify localhost to connect from the same machine. The reason that this does not work is that the default privileges include an entry with Host='localhost' and User=''. Because that entry has a Host value 'localhost' that is more specific than '%', it is used in preference to the new entry when connecting from localhost! The correct procedure is to insert a second entry with Host='localhost' and User='some_user', or to delete the entry with Host='localhost' and User=''. After deleting the entry, remember to issue a FLUSH PRIVILEGES statement to reload the grant tables. See also Section 5.4.4, “Access Control, Stage 1: Connection Verification”. 解决： [crayon-5727fa7f5ab58633763515/]

↧

jackson解析json字符串

May 3, 2016, 12:25 am

≫ Next: SpringMVC Hibernate no session found for current thread

≪ Previous: Host ‘xxx.xx.xxx.xxx’ is not allowed to connect to this MySQL server

概述 jackson解析json例子 准备工作 基于JDK1.7,依赖Jackson框架类库： jackson-core-2.5.3.jar jackson-databind-2.5.3.jar Example 下面的例子是基于Jackson 2.x版本的树模型的Json解析。要解析的Json字符串：

[crayon-5729561f1a9c9715559706/] 示例代码：

[crayon-5729561f1a9df785402794/]

测试结果：

[crayon-5729561f1a9ee063508277/]

↧

SpringMVC Hibernate no session found for current thread

May 5, 2016, 2:55 am

≫ Next: 如何使用ArrayList进行冒泡排序

≪ Previous: jackson解析json字符串

配置了app-context.xml 中hibernate的各种设置，仍然报错，关键配置：要配置txmanager, aop等，因为这是完整配置 [crayon-572d607b5a0c8345597045/] 配置 <property name="current_session_context_class">thread</property> 也不work. tricky在于配置web.xml的org.springframework.orm.hibernate4.support.OpenSessionInViewFilter 配置方法：解决了 [crayon-572d607b5a0d8243278385/] 附件：app-context.xml [crayon-572d607b5a0de393253398/] web.xml [crayon-572d607b5a0e9683226739/]

↧

如何使用ArrayList进行冒泡排序

May 9, 2016, 8:41 pm

≫ Next: no qualifying bean of type is defined

≪ Previous: SpringMVC Hibernate no session found for current thread

[crayon-57316df90e904249452673/]

↧

no qualifying bean of type is defined

May 10, 2016, 11:36 pm

≫ Next: org.hibernate.hql.ast.QuerySyntaxException: expecting OPEN, found ‘>’ near line 1, column xxx

≪ Previous: 如何使用ArrayList进行冒泡排序

Multiple things can cause this, I didn't bother to check your entire repository, so I'm going out on a limb here. First off, you could be missing an annotation (@Service or @Component) from the implementation of com.example.my.services.user.UserService, if you're using annotations for configuration. If you're using (only) xml, you're probably missing the <bean> -definition for the UserService-implementation. If you're using annotations and the implementation is annotated correctly, check that the package where the implementation is located in is scanned (check your <context:component-scan base-package= -value).

↧

org.hibernate.hql.ast.QuerySyntaxException: expecting OPEN, found ‘>’ near line 1, column xxx

May 11, 2016, 1:53 am

≫ Next: MySQL错误“Specified key was too long; max key length is 1000 bytes”的解决办法

≪ Previous: no qualifying bean of type is defined

org.hibernate.hql.ast.QuerySyntaxException: expecting OPEN, found '>' near line 1, column 38 hibernate 对数据库的识别，原来sql 中有count , count 是数据库中关键字，所以报错

↧

MySQL错误“Specified key was too long; max key length is 1000 bytes”的解决办法

May 16, 2016, 3:52 am

≫ Next: 如何使用命令行创建utf8 utf8_general_ci 的DBCommand to create MySQL database with Character set UTF-8

≪ Previous: org.hibernate.hql.ast.QuerySyntaxException: expecting OPEN, found ‘>’ near line 1, column xxx

今天在为数据库中的某两个字段设置unique索引的时候，出现了Specified key was too long; max key length is 1000 bytes错误

经过查询才知道，是Mysql的字段设置的太长了，于是我把这两个字段的长度改了一下就好了。建立索引时，数据库计算key的长度是累加所有Index用到的字段的char长度后再按下面比例乘起来不能超过限定的key长度1000： latin1 = 1 byte = 1 character uft8 = 3 byte = 1 character gbk = 2 byte = 1 character 举例能看得更明白些，以GBK为例： CREATE UNIQUE INDEX unique_record ON reports (report_name, report_client, report_city); 其中report_name varchar(200), report_client varchar(200), report_city varchar(200) (200 + 200 +200) * 2 = 1200 > 1000，所有就会报1071错误，只要将report_city改为varchar(100)那么索引就能成功建立。如果表是UTF8字符集，那索引还是建立不了。

↧

如何使用命令行创建utf8 utf8_general_ci 的DBCommand to create MySQL database with Character set UTF-8

May 18, 2016, 1:38 am

≫ Next: No modifications are allowed to a locked ParameterMap解决

≪ Previous: MySQL错误“Specified key was too long; max key length is 1000 bytes”的解决办法

If database name contains nonalphanumeric chars use "

" to quote:

CREATE DATABASE

my-db

 CHARACTER SET utf8 COLLATE utf8_general_ci;
When using in shell script quote the quotes with "\"

mysql -p -e "CREATE DATABASE

my-db` CHARACTER SET utf8 COLLATE utf8_general_ci;" Question: utf8_general_ci 是什么？ 官方： For any Unicode character set, operations performed using the xxx_general_ci collation are faster than those for the xxx_unicode_cicollation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons forutf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters. To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect this has in comparisons or when doing searches, see Section 11.1.8.7, “Examples of the Effect of Collation”): [crayon-573ec63b1baff143258945/] A difference between the collations is that this is true for utf8_general_ci: [crayon-573ec63b1bb08904897963/] Whereas this is true for utf8_unicode_ci, which supports the German DIN-1 ordering (also known as dictionary order): [crayon-573ec63b1bb0e190585160/] MySQL implements language-specific collations for the utf8 character set only if the ordering with utf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German dictionary order and French, so there is no need to create special utf8collations. utf8_general_ci also is satisfactory for both German and French, except that “ß” is equal to “s”, and not to “ss”. If this is acceptable for your application, you should use utf8_general_ci because it is faster. If this is not acceptable (for example, if you require German dictionary order), use utf8_unicode_ci because it is more accurate. MySQL :: MySQL 5.7 Reference Manual :: 11.1.15.1 Unicode Character Sets http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html 其它一： utf8_general_ci是一个遗留的校对规则，不支持扩展，它仅能够在字符之间进行逐个比较。这意味着utf8_general_ci校对规则进行的比较速度很快，但是与使用utf8_unicode_ci的校对规则相比，比较正确性较差。 However：utf8_unicode_ci比较准确，utf8_general_ci速度比较快。通常情况下 utf8_general_ci的准确性就够我们用的了，在我看过很多程序源码后，发现它们大多数也用的是utf8_general_ci，所以新建数据库时一般选用utf8_general_ci就可以了 mysql中utf8_bin、utf8_general_ci、utf8_general_cs编码区别 - huanleyan的专栏 - 博客频道 - CSDN.NET http://blog.csdn.net/chenghuan1990/article/details/10078931 其它二： utf8_general_ci is a very simple — and on Unicode, very broken — collation, one that givesincorrect results on general Unicode text. What it does is:

converts to Unicode normalization form D for canonical decomposition
removes any combining characters
converts to upper case

This does not work correctly on Unicode, because it does not understand Unicode casing. Unicode casing alone is much more complicated than an ASCII-minded approach can handle. For example:

The lowercase of “ẞ” is “β”, but the uppercase of “β” is “SS”.
There are two lowercase Greek sigmas, but only one uppercase one; consider “Σίσυφος”.
Letters like “ø” do not decompose to an “o” plus a diacritic, meaning that it won’t correctly sort.

There are many other subtleties.

utf8_unicode_ci uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".

utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.

utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block:utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.

The cost of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci. But that’s the price you pay for correctness. Either you can have a fast answer that’s wrong, or a very slightly slower answer that’s right. Your choice. It is very difficult to ever justify giving wrong answers, so it’s best to assume that utf8_general_ci doesn’t exist and to always use utf8_unicode_ci. Well, unless you want wrong answers. Source: http://forums.mysql.com/read.php?103,187048,188748#msg-188748

↧

No modifications are allowed to a locked ParameterMap解决

May 18, 2016, 1:40 am

≫ Next: HashMap循环遍历方式及其性能对比

≪ Previous: 如何使用命令行创建utf8 utf8_general_ci 的DBCommand to create MySQL database with Character set UTF-8

通常，在使用spring,struts2或者其他框架的时候，会构造一个请求对象，改变请求里头的值。这时候就应该需要仔细的查看javaee servlet方面的知识。这里讲一个我经常要做的一件事：修改request里ParameterMap的值，如果直接修改 [crayon-573ec63b1adf2471754113/] 则会出现 [crayon-573ec63b1ae00265983585/] 解决办法如下： [crayon-573ec63b1ae06365760498/]

↧

HashMap循环遍历方式及其性能对比

May 23, 2016, 10:52 pm

≫ Next: 【Redis 使用一】在项目中使用redis 并结合Spring

≪ Previous: No modifications are allowed to a locked ParameterMap解决

1. Map的四种遍历方式 下面只是简单介绍各种遍历示例(以HashMap为例)，各自优劣会在本文后面进行分析给出结论。 (1) for each map.entrySet()

Map<String, String> map = new HashMap<String, String>();

for (Entry<String, String> entry : map.entrySet()) {

entry.getKey();

entry.getValue();

}

(2) 显示调用map.entrySet()的集合迭代器

Iterator<Map.Entry<String, String>> iterator = map.entrySet().iterator();

while (iterator.hasNext()) {

Map.Entry<String, String> entry = iterator.next();

entry.getKey();

entry.getValue();

}

(3) for each map.keySet()，再调用get获取

Map<String, String> map = new HashMap<String, String>();

for (String key : map.keySet()) {

map.get(key);

}

(4) for each map.entrySet()，用临时变量保存map.entrySet()

Set<Entry<String, String>> entrySet = map.entrySet();

for (Entry<String, String> entry : entrySet) {

entry.getKey();

entry.getValue();

}

在测试前大家可以根据对HashMap的了解，想想上面四种遍历方式哪个性能更优。 2、HashMap四种遍历方式的性能测试及对比 以下是性能测试代码，会输出不同数量级大小的HashMap各种遍历方式所花费的时间。

PS：如果运行报异常in thread “main” java.lang.OutOfMemoryError: Java heap space，请将main函数里面map size的大小减小。其中getHashMaps函数会返回不同size的HashMap。 loopMapCompare函数会分别用上面的遍历方式1-4去遍历每一个map数组(包含不同大小HashMap)中的HashMap。 print开头函数为输出辅助函数，可忽略。测试环境为Windows7 32位系统 3.2G双核CPU 4G内存，Java 7，Eclipse -Xms512m -Xmx512m 最终测试结果如下：

compare loop performance of HashMap

-----------------------------------------------------------------------

map size | 10,000 | 100,000 | 1,000,000 | 2,000,000

-----------------------------------------------------------------------

for each entrySet | 2 ms | 6 ms | 36 ms | 91 ms

-----------------------------------------------------------------------

for iterator entrySet | 0 ms | 4 ms | 35 ms | 89 ms

-----------------------------------------------------------------------

for each keySet | 1 ms | 6 ms | 48 ms | 126 ms

-----------------------------------------------------------------------

for entrySet=entrySet()| 1 ms | 4 ms | 35 ms | 92 ms

-----------------------------------------------------------------------

表横向为同一遍历方式不同大小HashMap遍历的时间消耗，纵向为同一HashMap不同遍历方式遍历的时间消耗。 PS：由于首次遍历HashMap会稍微多耗时一点，for each的结果稍微有点偏差，将测试代码中的几个Type顺序调换会发现，for each entrySet耗时和for iterator entrySet接近。 3、遍历方式性能测试结果分析 (1) foreach介绍 见：ArrayList和LinkedList的几种循环遍历方式及性能对比分析中介绍。 (2) HashMap遍历方式结果分析 从上面知道for each与显示调用Iterator等价，上表的结果中可以看出除了第三种方式(for each map.keySet())，再调用get获取方式外，其他三种方式性能相当。本例还是hash值散列较好的情况，若散列算法较差，第三种方式会更加耗时。我们看看HashMap entrySet和keySet的源码

private final class KeyIterator extends HashIterator<K> {

public K next() {

return nextEntry().getKey();

}

private final class EntryIterator extends HashIterator<Map.Entry<K,V>> {

public Map.Entry<K,V> next() {

return nextEntry();

}

分别是keySet()和entrySet()返回的set的迭代器，从中我们可以看到只是返回值不同而已，父类相同，所以性能相差不多。只是第三种方式多了一步根据key get得到value的操作而已。get的时间复杂度根据hash算法而异，源码如下：

public V get(Object key) {

if (key == null)

return getForNullKey();

Entry<K,V> entry = getEntry(key);

return null == entry ? null : entry.getValue();

}

/**

* Returns the entry associated with the specified key in the

* HashMap. Returns null if the HashMap contains no mapping

* for the key.

final Entry<K,V> getEntry(Object key) {

int hash = (key == null) ? 0 : hash(key);

for (Entry<K,V> e = table[indexFor(hash, table.length)];

e != null;

e = e.next) {

Object k;

if (e.hash == hash &&

((k = e.key) == key || (key != null && key.equals(k))))

return e;

}

return null;

}

get的时间复杂度取决于for循环循环次数，即hash算法。 4、结论总结 从上面的分析来看： a. HashMap的循环，如果既需要key也需要value，直接用

Map<String, String> map = new HashMap<String, String>();

for (Entry<String, String> entry : map.entrySet()) {

entry.getKey();

entry.getValue();

}

即可，foreach简洁易懂。 b. 如果只是遍历key而无需value的话，可以直接用

Map<String, String> map = new HashMap<String, String>();

for (String key : map.keySet()) {

// key process

}

↧

【Redis 使用一】在项目中使用redis 并结合Spring

May 24, 2016, 2:32 am

≫ Next: 【Redis 使用二】使用jedis 进行CRUD增删改查

≪ Previous: HashMap循环遍历方式及其性能对比

spring app-context.xml [crayon-57480dcf0c1eb557681481/] [crayon-57480dcf0c1f5463075252/] Java code [crayon-57480dcf0c1fc222321332/] [crayon-57480dcf0c201118345953/]

↧

【Redis 使用二】使用jedis 进行CRUD增删改查

May 24, 2016, 2:33 am

≫ Next: rabbitmq堆积消息后生产速率降低的问题分析及应对措施

≪ Previous: 【Redis 使用一】在项目中使用redis 并结合Spring

[crayon-57480dcf0be0a493176163/]

↧

rabbitmq堆积消息后生产速率降低的问题分析及应对措施

May 25, 2016, 7:43 pm

≫ Next: 使用Spark和Kafka进行实时的数据流关联处理

≪ Previous: 【Redis 使用二】使用jedis 进行CRUD增删改查

问题描述：

在rabbitmq没有消费者的情况下，生产者持续向mq发消息，使得消息在mq中大量堆积，发送速率不受影响，但当有新的消费者连接上mq并开始接收消息时，生产速率大幅降低。

问题分析：

Rabbitmq的中处理队列收发逻辑的是一个有穷状态机进程，它对消息的处理流程可以概括为下图所示的流程：

当MQ既有生产者也有消费者时，该状态机的处理流程为：接收消息->持久化->发送消息->接收消息 –> … ->。在流控机制的控制下，收发速率能够保持基本一致，队列中堆积的消息数会非常低。
当没有消息者时，处理流程如图中橙色线条所示，MQ会持续接收消息并持久化直到磁盘被写满，因为没有发送逻辑，这时可以达到更高的生产速率。
当MQ中有消息堆积时，处理流程如图中绿色线条所示，MQ会持续从队列中取出堆积的消息将其发送出去，直到没有了堆积消息，或者消费者的qos被用光，或者没有消费者，或者消费者的channel被阻塞。如果一直没有满足上述4个条件之一，MQ就会持续的发送堆积消息，不去处理新来的消息，在流控机制的作用下，发送端就被阻塞了。

总结：从上述描述可以看出，消息堆积后，发送速率降低是MQ的处理流程使然，不是bug。这样的流程设计基于以下两个原因：

让堆积的消息更快的被消费掉，降低消息的时延。
MQ中堆积的消息越少，每个消息处理的平均开销就越少，可以提高整体性能，所以需要尽快将堆积消息发送出去。

应对措施：

打破发送循环条件。

设置合适的qos值，当qos值被用光，而新的ack未被mq接收时，就可以跳出发送循环，去接收新的消息。
消息者到主动block接收进程，消费者感知到接收消息的速度过快时，主动block，利用block与unblock方法调节接收速率。当接收进程被block时，mq跳出发送循环。

建立新的队列若服务器cpu资源有较多剩余，而又不需要保证消息的顺序的情况下可以通过建立新的vhost，在该vhost下创建queue，生产者将消息发送掉新的queue，消费者同时订阅新旧queue。
使用缓存

在生产者端使用缓存，当生产速率受到流控限制时，缓存数据。在堆积的消息被处理完后，生产速率恢复正常时，此时将缓存的数据发送给MQ。

更新rabbitmq版本

在新版2.8.4中，在有大量消息堆积时，生产速率会受到抑制，但生产者不会完全被阻塞。

加机器。

↧

使用Spark和Kafka进行实时的数据流关联处理

May 26, 2016, 8:18 pm

≫ Next: java中判断字符串是否为数字的方法的几种方法

≪ Previous: rabbitmq堆积消息后生产速率降低的问题分析及应对措施

This post goes over doing a few aggregations on streaming data using Spark Streaming and Kafka. We will be setting up a local environment for the purpose of the tutorial. If you have Spark and Kafka running on a cluster, you can skip the getting setup steps.

The Challenge of Stream Computations

Computations on streams can be challenging due to multiple reasons, including the size of a dataset. Certain metrics such as quantiles need to iterate over the entire dataset in a sorted order using standard formulae/ practices and they may not be the most suited approach, for example, mean = sum of value/ count. For a streaming dataset, this is not fully scalable. Instead, suppose we store the sum and count and each new item is added to the sum. For every new item, we increment the count, and whenever we need the average, we divide the sum by the count. Then we get the mean at that instance.

Calculating Percentile

Percentile requires finding the location of an item in a large dataset; for example, 90th percentile would mean the value that is over 90 percent of the values in a sorted dataset. To illustrate, in [9, 1, 8, 7, 6, 5, 2, 4, 3, 0], the 80th percentile would be 8. This means we need to sort the dataset and then find an item by its location. This clearly is not scalable. Scaling this operation involves using an algorithm called tdigest. This is a way of approximatingpercentile at scale. tdigest creates digests that create centroids at positions that are approximated at the appropriate quantiles. These digests can be added to get a complete digest that can be used to estimate the quantiles of the whole dataset. Spark allows us to do computations on partitions of data, unlike traditional Map Reduce. So we calculate the digests for every partition and add them in the reduce phase to get a complete digest. This is the only time we need to converge that data at one point (reduce operation). We then use Spark's broadcast feature to broadcast the value. This value is then used for filtering the dataset to leave us an RDD matching our criteria (top 5 percentile). We then use mapPartitions to send the values of each partition to Kafka (this could be any message handler, post, and so on).

Nature of the Data

We are using fictitious data. It contains two columns: user_id, and activity_type. We are going to compute popular users. The activity can be of the following types: profile.picture.like, profile.view, and message.private. Each of these activities will have a different score.

Metrics We Would Like to Compute

We would like to compute the most popular users, that is, the top 5 percentile of users (score and list of users).

Prerequisites

You must have Docker, Python 2.7, and JRE 1.7 installed, as well as Scala and basic familiarity with Spark and the concept of RDDs.

Getting Setup with Kafka

Download the Kafka container. For the purpose of this tutorial, we will run Kafka as a Docker container. The container can be run with Mac: [crayon-57480dcf0acaf017052086/] Linux (Docker installed directly on the machine): [crayon-57480dcf0acbd978859377/] More information about the container can be found at here. This should get you started with running a Kafka instance that we will be using for this tutorial. We also download the Kafka binaries locally to test the Kafka consumer, create topics, and so on. Kafka binaries can be found at here. Download and extract the latest version. The directory containing the Kafka binaries will be referred to as $KAFKA_HOME.

Getting Setup with Spark

The next step is to install Spark. We have two options to run spark:

Run it on a Docker container
Run it locally

Running Spark Locally

Download Spark binaries from here: 解压文件 [crayon-57480dcf0acc3780246734/] If you have IPython installed, you can also use IPython with pyspark by using the following line: [crayon-57480dcf0acc8992228456/]

在Docker容器中把Spark跑起来

[crayon-57480dcf0accd375546956/] This will mount a directory named my_code on your local system to the /app directory on the Docker container. The Spark shell starts with the Spark Context available as sc and the HiveContext available as the following: [crayon-57480dcf0acd2072951593/] Here is a simple Spark job for testing the installation: [crayon-57480dcf0acd7612296631/]

Spark Streaming Basics

Spark streaming is an extension of the core Spark API. It can be used to process high-throughput, fault-tolerant data streams. These data streams can be nested from various sources, such as ZeroMQ, Flume, Twitter, Kafka, and so on. Spark Streaming breaks the data into small batches, and these batches are then processed by Spark to generate the stream of results, again in batches. The code abstraction from this is called DStream, which represents a continuous stream of data. A DStream is a sequence of RDDs loaded incrementally. More information on Spark Streaming can be found in the Spark Streaming Programming guide.

Kafka Basics

Kafka is a publish-subscribe messaging system. It is distributed, partitioned, and replicated. Terminology: A category of feeds is called a topic; for example, weather data from two different stations could be different topics.

The publishers are called Producers.
The subscribers of these topics are called Consumers.
The Kafka cluster has one or more servers each of which is called a broker.
More details can be found at here.

Generating Mock Data

We can generate data in two ways:

Statically generated data
Continuous data generation

We can use statically generated data to generate a dataset and use that in our Kafka producers. We could use the following method to generate random data: [crayon-57480dcf0acdd759173135/] We can also generate data on the fly using this code: [crayon-57480dcf0ace3877025637/] The full source code can be found at the GitHub repo. Now we can start the producer and use the following line: [crayon-57480dcf0ace8534278351/] We can see the Kafka messages being printed to the console. At this point, we have our producer ready.

Aggregation and Processing Using Spark Streaming

This process can be broken down into the following steps:

Reading the message from the Kafka queue.
Decoding the message.
Converting the message type text to its numeric score.
Updating the score counts for incoming data.
Filtering for the most popular users.

Reading Messages from the Kafka Queue

Reading messages in pyspark is possible using the KafkaUtils module to create a stream from a Kafka queue. [crayon-57480dcf0aced155946789/] Load the message and convert the type text to key. This is done by using Python’s built-in json module and returning a tuple of the relevant values. If you notice, we used this: [crayon-57480dcf0acf3321691889/] Here, scores is a dictionary that maps the message type text to a numeric value. We then broadcast this dictionary out to all the nodes as score_b, using the following lines: [crayon-57480dcf0acf8301597309/] Next, we access the dictionary using scores_b.value, which returns us the original dictionary. Spark uses a bit torrent style broadcast, where the master broadcasts the value to a few nodes and the other nodes replicate this value from those nodes. [crayon-57480dcf0acfd915706489/] Now we count incoming messages and update the score count. For this step, we use the updateStateByKey function on the DStream. The updateStateByKey function returns a new DStream by applying the provided function to the previous state of the DStream and the new values. This function operates somewhat similarly to a reduce function. The function provided to updateStateByKey has the accumulated value from the previous operations and the new value, and we can aggregate or combine these in our function that we provide. We also have to note that the first value is used as the key by default, so in this case the userId is the key, which is ideal. The score is the value. [crayon-57480dcf0ad02358137147/] Now we can filter the most popular users. We compute the desired percentile and filter based on it. To calculate the percentile, we use the tdigest algorithm. This algorithm allows us to estimate the percentile value in a single pass and thus is very useful and efficient for streaming data. The orignal tdigest repo from Ted Dunning can be found at here. An open source Python implementation of this algorithm was used and it can be found at here. We create a digest_partitions function that takes values from a given partition and adds them to the digest. In the reduce step, these digests are added to provide a final digest that can provide us the percentile value. We then broadcast this percentile value, which we later use in our filter. We could have also performed the computation of the digest within the filter_most_popular function, but this way we can easily add some form of output such as a Kafka producer to publish the percentile value, if needed. [crayon-57480dcf0ad08744204346/] This filtered RDD can now be broadcast using Kafka. To broadcast the values, we used a Keyed producer and key on the timestamp. We use the foreachPartition function to publish each partition of the RDD instead of publishing each value at once to avoid the overhead of creating a huge number of network connections to Kafka. [crayon-57480dcf0ad0e398069172/] The complete code can be found here. The code can be run using: [crayon-57480dcf0ad13237376234/]

About the Author

Anant Asthana is a principal consultant and data scientist at Pythian. He is also an avid outdoorsman and is very passionate about open source software.

↧

java中判断字符串是否为数字的方法的几种方法

May 30, 2016, 12:35 am

≫ Next: JAVA WEB支持CORS 跨域访问

≪ Previous: 使用Spark和Kafka进行实时的数据流关联处理

1.用JAVA自带的函数

[crayon-574c18315c46a579375851/]

2.用正则表达式首先要import java.util.regex.Pattern 和 java.util.regex.Matcher

[crayon-574c18315c473030305837/]

3.使用org.apache.commons.lang org.apache.commons.lang.StringUtils; boolean isNunicodeDigits=StringUtils.isNumeric("aaa123456789"); http://jakarta.apache.org/commons/lang/api-release/index.html下面的解释:

[crayon-574c18315c479034421003/]

上面三种方式中，第二种方式比较灵活。第一、三种方式只能校验不含负号“-”的数字，即输入一个负数-199，输出结果将是false；而第二方式则可以通过修改正则表达式实现校验负数，将正则表达式修改为“^-?[0-9]+”即可，修改为“-?[0-9]+.?[0-9]+”即可匹配所有数字。

↧

JAVA WEB支持CORS 跨域访问

May 30, 2016, 12:51 am

≫ Next: mysqld: unknown variable ‘default-character-set=utf8’

≪ Previous: java中判断字符串是否为数字的方法的几种方法

平时在web项目开发中会经常遇到一些跨域操作，但由于由于安全限制(同源策略，即JavaScript或Cookie只能访问同域下的内容)，会造成Origin null is not allowed by Access-Control-Allow-Origin”错误。目前经常用于跨域操作的两个解决方案： JSONP和CORS（Cross Origin Resource Sharing ）

问题原因

这是由于浏览器的同源策略限制的缘故，简单来说，从HTML中发出XMLHttpRequest 请求时，Browser会做检查，如果发现Response中没有Access-Control-Allow-Origin Header或Access-Control-Allow-Origin Header Header的值与 HTML的 orgin 不同时，Browser会拒接绝该Response，Javascript就收不到该Response。本地HTML的Origin是 null, 而Server端没有发出Access-Control-Allow-Origin Header Header给Browser, 所以会有了“Origin null is not allowed by Access-Control-Allow-Origin”错误。

JSONP和CORS比较

JSONP 只能用于GET请求，并且有一定的安全隐患，因为JSONP的实现机制实际上类似于注入脚本
CORS Cross Origin Resource Sharing (CORS) W3C标准，专门用来解决跨域问题。支持各种形式的请求。由于这是比较新的标准，旧的浏览器会不支持。

CORS 原理

CORS 定义了一套Access-Control Header控制的标准, 定义以下头部信息，来控制是否允许跨域访问，什么样的请求能够跨域 [crayon-574c18315ba4c518723155/]

CORS 浏览器支持

目前主流的浏览器大多都支持CORS，旧版的 IE7不支持。

Tomcat 下配置 CORS

Apache Tomcat 从7.0 版本之后才支持CORS，以下配置示例 [crayon-574c18315ba59732544796/] 详细的配置参数请参见： http://tomcat.apache.org/tomcat-7.0-doc/config/filter.html#CORS_Filter 相关文档： http://enable-cors.org/ https://www.w3.org/TR/cors/

↧

mysqld: unknown variable ‘default-character-set=utf8’

June 3, 2016, 8:19 am

≫ Next: 构建Java 高性能编程

≪ Previous: JAVA WEB支持CORS 跨域访问

今天升级mysql的时候，DB一直起不来 [crayon-575fec2b8c812645262614/] 查日志如下 [crayon-575fec2b8c824288382817/] 看到第一个Error，原来是这个问题 mysql - 'unknown variable "character-set-server=utf-8"' error at mysqldump - Stack Overflow http://stackoverflow.com/questions/26135583/unknown-variable-character-set-server-utf-8-error-at-mysqldump 在my.cnf里面查看, 原来是文件中mysqld和client里面都是配置的 [crayon-575fec2b8c82d312230470/] 就不行，去掉client里的这行也不行，又看了一个日文boy写的 fedora15 mysql5.5 default-character-setが原因で起動できない - ITとともに生きよう http://d.hatena.ne.jp/nightmare_tim/20110530/1306704112 好了，分别配置即可，如下： [crayon-575fec2b8c833631694932/]

↧

MOOC课程

书籍

相关博文和教程

相关深度学习框架

相关组织

问题描述：

问题分析：

应对措施：

The Challenge of Stream Computations

Calculating Percentile

Nature of the Data

Metrics We Would Like to Compute

Prerequisites

Getting Setup with Kafka

Getting Setup with Spark

Running Spark Locally

Spark Streaming Basics

Kafka Basics

Generating Mock Data

Aggregation and Processing Using Spark Streaming

Reading Messages from the Kafka Queue

About the Author

问题原因

JSONP和CORS比较

CORS 原理

CORS 浏览器支持

Tomcat 下配置 CORS